{
 "metadata": {
  "name": "making-groundtruth"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pylab import *"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Making Groundtruth"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Creating ground truth for OCR training from scratch is usually\n",
      "a fairly laborious process.\n",
      "However, we can create some ground truth fairly automatically.\n",
      "\n",
      "There are many existing OCR systems. Even if their error rates\n",
      "may be higher than that of OCRopus, they can still help us get\n",
      "a larger variety of training data, in particular for initial\n",
      "training.\n",
      "\n",
      "Let's demonstrate this with the Tesseract OCR system.\n",
      "We first as Tesseract to produce output in hOCR format."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!tesseract -l deu-frak 0211.bin.png 0211 hocr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Tesseract Open Source OCR Engine v3.02 with Leptonica\r\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see in the output, there is information about bounding\n",
      "boxes and locations of text lines and words."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!fmt 0211.html | sed '20,30!d' "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2709 805 2906 885\">wuthe</span> <span class='ocr_word' id='word_13'\r\n",
        "title=\"bbox 2948 803 3266 881\">(wadete)</span> <span class='ocr_word'\r\n",
        "id='word_15' title=\"bbox 3308 799 3445 871\">\u00fcber</span> </span>\r\n",
        "<span class='ocr_line' id='line_3' title=\"bbox 1542 915 3444\r\n",
        "1005\"><span class='ocr_word' id='word_17' title=\"bbox 1542 925 1655\r\n",
        "989\">alle</span> <span class='ocr_word' id='word_19' title=\"bbox 1688\r\n",
        "927 1954 1005\">Wasser,</span> <span class='ocr_word' id='word_21'\r\n",
        "title=\"bbox 2000 922 2155 997\">dorst</span> <span class='ocr_word'\r\n",
        "id='word_23' title=\"bbox 2198 922 2524 998\">(braucht)</span> <span\r\n",
        "class='ocr_word' id='word_25' title=\"bbox 2564 923 2702 992\">\u00fcber</span>\r\n",
        "<span class='ocr_word' id='word_27' title=\"bbox 2753 926 2905\r\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This OCR gets quite a bit wrong, but we only care about getting some training\n",
      "data, not transcribing the entire document."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!lynx -dump 0211.html | sed '20,30!d'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n",
        "   Feinde, Winden und Hunnen, meinten, es w\u00e4r der leidige Teufel.\r\n",
        "\r\n",
        "   19. Riesen\u00bb S\u00e4ulen.\r\n",
        "\r\n",
        "   Wi n ketiiia n n\u2019 s hessifche Chronik. S. z2. M el i s s a n t e s in\r\n",
        "   0rqgraph. bei Maichene Berg.\r\n",
        "\r\n",
        "   Bei Miltenberg oder Kleinen-Haubaih auf einein hohen Geb\u00fcrg ini Walde\r\n",
        "   sind neun geivaltige, gro\u00dfe,\r\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To get good training data, we just focus on output that is correctly spelled.\n",
      "\n",
      "The `hocr-xwords` script reads such hOCR files and takes a spelling dictionary.\n",
      "It then goes through the hOCR output produced by the OCR engine, picks out\n",
      "sequences of words that are spelled correctly (some common punctuation is allowed),\n",
      "and writes out both the text (as ground truth) and the corresponding image\n",
      "in a format suitable for training."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!hocr-xwords -w 400 -d de_DE 0211.html"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "page_1 0211.bin.png\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "(7009, 4959) (7009, 4959)\r\n",
        "alle Wasser,\r\n",
        "keine Br\u00fccke gehen,\r\n",
        "nieder, h\u00e4ngt\r\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "gute Gesellen\r\n",
        "sieben oder acht\r\n",
        "viel Volks wider solche Kr\u00f6ten\r\n",
        "Riesen nennt\r\n",
        "Feinde, Winden\r\n",
        "Hunnen, meinten,\r\n",
        "leidige Teufel.\r\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here is an example of what this looks like."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import codecs\n",
      "print repr(codecs.open(\"0211/010001.gt.txt\",\"r\",\"utf-8\").read())\n",
      "imshow(imread(\"0211/010001.bin.png\"),cmap=cm.gray)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "u'alle Wasser,'\n"
       ]
      },
      {
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "<matplotlib.image.AxesImage at 0x46fbb10>"
       ]
      },
      {
       "output_type": "display_data",
       "png": "iVBORw0KGgoAAAANSUhEUgAAAWwAAABfCAYAAADMBv0OAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXdUFdfWwPeliFIERQEFQq+CFxTFLgjiwxrRGMAC2GKJ\nUd7TiL4YLE/FEgJKLHkR1CyVGGMlaoAggi8iz4gNCyKgIEQpohSRtr8//O59XO70O3BB57fWWQtm\nztlnz56Zfc+cso8IEREEBAQEBDo8KspWQEBAQECAGYLDFhAQEOgkCA5bQEBAoJMgOGwBAQGBToLg\nsAUEBAQ6CYLDFhAQEOgkcHbYFy9eBHt7e7CxsYFt27bxqZOAgICAAAEiLvOwm5qawM7ODpKTk8HY\n2BgGDRoEx44dAwcHh7bQUUBAQEAAOLawMzMzwdraGszNzUFdXR38/f3hzJkzfOsmICAgINACNS6F\nnj17BqamptL/TUxM4Nq1azJ5RCKRYpoJCAgIfKCQdXxwamEzdcaI2OFTQUEBnDp1CoyNjWV0HzFi\nBDQ0NHCSWV1dDcuXL1fYDuHh4Uq3j6KppKQETp06BaWlpUrT4X2wo7JTSxumpaXBqVOn5JKydezo\nielzSAWnFraxsTEUFhZK/y8sLAQTExMuopROUlISLFiwQPq/vb09zJkzB2bOnAlqavLmWbt2LTg5\nOUFgYCCpzJUrV8K+ffvaRN/OxNq1ayE3Nxd+/vlnCAwMBDMzMzA2NoalS5cqWzWIiIiA169fS/9f\nunSp3I/2+8Tly5fBwMAATp48CTU1NTLnunTpAuvXr2ck5+LFi7Bw4UKZ91/CmjVroH///uDv78+H\nygJEIAcaGhrQ0tIS8/Pz8e3btygWi/HevXsyeTiKblfy8/PRwcEBAUCaJk6cSFkGANDAwABPnDhB\neN7Pz09GniJ2CA8P51y2I9DaDgCA3bt3Ry8vL0xJSeGtnokTJ6KXlxd+++23hOdb23HDhg2ooaEh\no9egQYNw3LhxvOnUkbh9+zZaW1tj//79UV1dXe6eqKqqopeXlzTV1tZK//7tt98QETEsLAy9vLzQ\nzMyM8L5KkqGhIZ48eVLJV9wxYfo+U/kMzt7k/PnzaGtri1ZWVrhlyxZWlbY3w4cPx969e8sdv3v3\nrszD1qNHD6ytraWUJckbGRkpd87f3x9FIpGMzKtXr3LW+9KlS5zLKoN79+6hjo4OHj58GBGJHbYk\nde3aFZ8+fcpLvdra2ggAqK6ujjo6OpibmytzvrUdP/nkE1K9dHR0cP369bzo1RFobGxETU1NynvR\nOknsKblPOjo62K1bN8blNTQ08MaNG8q+9A4H0/eZs8MOCQlBAwMDdHJykh4rLy9Hb29vtLGxwbFj\nx+LLly9ZV9reuLq6Sh+mRYsWYXZ2NmZnZ3NqCdvY2CAA4OrVqzE7OxsREWtqavDvf/+7nLw+ffrg\nw4cP2/LSOgwPHjxADQ0NXLdunczx1NRUdHR0JHUaBgYGWFhYqHD9bO5laGgoYUuzZdqzZw82NjYq\nrJcyycvLY+Wo+U6tv7oFmMHZYaelpeGNGzdkHPaqVatw27ZtiIgYERGBq1evZl1pe9PSYVMlJhQV\nFcmUiYuLwzVr1sjJsrKy4vWzvyOTlJSE3bp1w7Vr15Lm2bp1KwYHBxPafdCgQRgXF4dxcXGcdWgp\nz8/PjzZ/aGgoTp8+nfJ5iImJ4ayPssnMzERLS0ulOmw1NTW8cOGCsk3R6eDssBHf9fO2dNh2dnb4\n119/ISJiSUkJ2tnZsa60vWlLh02UevfujcnJyW18VcrnwYMHuGTJEvzoo48Y22/16tWUtlu6dCkn\nXVrKyMnJYVSmqqoK586dS6nPhg0bOOmjbEJDQ5XqrCUO+/Tp08o2RaeD6l1iPUvk+fPnYGhoCAAA\nhoaG8Pz5c9K8LUeePTw8wMPDg211nZKUlBRwcnJSthptTlFREezZs4dVma+//hqamppg586dhOf3\n7NkDdXV18MMPP/ChIiXa2towduxYiI2NJc1z7Ngx+Prrr9tcl/eRa9euwYABA5StRocnNTUVUlNT\nmWWm8/atW9h6enoy53v06MH6V6K9ae8W9odCYmKi9JpVVVUZl9u9e7fc4GzLNGLECGxqamKlS8vy\nTFvYEpYvX06pT1BQEDY3N7OSqUyOHz8uo79IJEJVVVW5RHTNKioqpHnI5EhSSxnXr19Xthk6LVQ+\nhPXCGUNDQ/jrr78AAKCkpAQMDAzYiuiQuLi48CLHzc2NFzkdnTdv3oCPj4/0//LycsZlP//8c/j0\n009Jz1+5cgXWrl2rkH5siIqKgoEDB5KeP3ToULu0+NsCkUgEn332GTQ2NsqliIgIGDp0qEy6evWq\nTJ45c+YAAIC6ujqsWrWKUI4kSWTEx8dT2lOAO6y7RCZPngyHDh2C1atXw6FDh+Djjz9uC73anfHj\nx/Mi5/z587zIed/529/+Br/++itUVVURnr9+/Trk5eWBpaUlray9e/fyrV6npbq6WuYZVFFRIbXP\nl19+CV9++SUjuXp6erRROf/44w/migpwgrKFHRAQAMOGDYOHDx+CqakpxMXFQVhYGCQlJYGtrS2k\npKRAWFhYe+napmzZskXZKnQqQkJCFCofFBQEPXv2JD3/+++/w507dxjJYup0PgRevXoFBw8eVLYa\nAm0EpcPevn072NnZgY2NDejq6kJVVRX07NkTjh8/DmZmZlBQUAAzZsyAysrK9tK3Q7Nv3z5KJ/Q+\n8dNPPyksg26gZfHixVBRUUGZJyQkBGpraxXWRYCaiooKWLRokbLV+OChdNjq6urw7bffQnZ2NmRk\nZMB3330H9+/fh4iICBg7dizk5OSAl5cXRERE8KZQRUWFXGoZ86Ejo62tDSoqH+YmPnp6eqzLmJub\nU54vKSmh7RsvLi6G5uZm1nW3pnv37grLeJ9pamqSjl0JKA9K72JkZCQdjNPW1gYHBwd49uwZnD17\nFoKCggDg3aft6dOneVHmypUroK+vL5cGDBgA9+7d4yx3+PDhvOhHx6xZs6CsrKxd6uqIXLlyhXeZ\ntra2pOdyc3N5s/fvv/9Oef7u3btQXV3NS13tCSLKhT4W6Lwwbg4WFBRAVlYWuLu7M56LvX79emmi\n+/w9d+4cTJs2jfDc48ePISQkhHGfZmt2797NqZwAO7h8MoeGhtLmIbt/R44cgRs3brCuk4yvvvqK\n9NyuXbvgyZMnvNXVXjQ3N8OqVat4kfXw4UP4z3/+w4ssgf+Rmpoq4yspYTIvsKqqCgcMGICnTp1C\nRGZzsRmKRsR3S5tNTU1p5ze7uLhgSUkJY7mt9aFK+/btYySHbh72ixcvOOlHx5w5c3DChAnSxHau\nMd8QXbuOjg5GR0ezksNkXru1tTVh2fXr1xPm52qbt2/fUuoxatQoTnLJuHTpEiYkJOCyZcvw7du3\nlEv7meLl5SWnd8+ePTkv+w8KCpKRFRERobCOdOTm5kqfczbv+6VLl6Tlpk6dSpv/9evXMu9UQ0OD\nImrzBpXvpPWq9fX16OPjIxO60s7OTmrI4uJiwuXpTB12Tk6OTHQwuvT48WNGcltTWFhIKff+/fuM\n5CjDYc+dOxdVVFRk6mn9o9nekF1/165dsVevXoydZlNTEx44cIBXh022mIsOOocNAGhpaclJdmue\nPHmCOjo6qK2tjWpqaqivr4/q6uq4detWheSS6d2tWzc8f/48a3mtHbaWlpZCESipmDp1Kvbq1Qv1\n9PRY30uJPSXlRCIRrdMuLy+XuTZ9fX0+LkNhqHwnZZcIIsK8efPA0dERVqxYIT0umYsNAArNxX7y\n5AnY2tqy6hu0srICAwMDyMvLg1evXjEup6+vT3qud+/eoK6uzlhWW1JRUQF5eXmQl5cHp06dApFI\nBLGxsXIDa5WVldClSxcoKSlRkqbE1NXVQVlZGdja2jLqQlBRUQEdHR1edXj58iWn7osuXbrApUuX\nKPOwWSBExpMnT8DMzAyqqqqguroaGhsboby8HBoaGmDNmjUQFxfHeiD1xYsXlJuIvHnzBurr61nJ\nfP36tdw8+ZqaGhg6dCjk5uaykkVGc3MzxMXFgUgkglOnTkFZWZnMrLOXL18yklNQUCCjKyLC8+fP\nScsXFhbK+YTy8nIYNGgQh6toR6g8fXp6OopEIhSLxeji4oIuLi544cIFLC8vRy8vL8oQqzSi8fLl\ny6xa1kRpx44djH6xEBFra2tJ5Rw4cICxnLZsYRcVFeHkyZNZ2cDDw4NzfYpAFVNakrS1tfHy5cu0\nslovpW6d2LawAYAw/jkTLl26RCrT19cXQ0JCOMmVkJ6eLtOCJEuxsbGsWsT+/v60MtkEYnr58iXO\nmzePUl5iYiIXEyDiu2iCx44dw9jYWFq96Th9+jRpWT8/PywuLpbJf+3aNezbty9hfjc3N87XxBdU\n10xpjTdv3uDgwYNRLBajg4MDhoWFISKzmNhklT569AhDQ0PRwsJCIWcN8K5PMT8/n5EROrrDrqqq\nYu2sAQBNTEwwISGBU52KQGXPlsnCwgKTkpIoZXF12FevXkU7O7s2ddgGBgYYGRmJkZGRWFZWxkmm\nhJSUFLS2tmZ8b/X19TE+Pp5WblpaGqkdWqbJkydjVVUVI11nzZpFK8/IyEg6rsWU58+fY2hoKDo5\nOTG2AxWxsbGopaVFWV7yw1dSUoKhoaHYr18/0rwdYccczg4b8V1wfsR324K5u7tjeno6o5jYRJUO\nHz6c0Y0aMmQIXrlyBQcOHEib19nZmVFgno7usAcPHiwny9HREa9cuYJXrlzB3r17k9bJx2AVW5qa\nmjA6OprRC7d582ZKWVwdNiKij48PYRl1dXXSWO1k1NfXyzyf58+fxz///JOVDDLu3LmD5ubmjJ1U\nSwdC94PM9D4wfT69vb0ZywsNDWVlhwEDBsiU19bWlj7jZO87FURb8hG9R8OHD2ccBO6LL75gdU18\no5DDllBTU4Nubm549+5dRjGxW1c6ZMgQRsaqr6+XjtY2NDRgfX09bZnO7rBb7yspSS1HrRsaGkjr\nVFFRwV9//ZV1vYqSnJzMi8NuamrCL774gleHDQA4ZcoUxtcyZMgQmV1o+I7jrKamJqdfQEAA1tfX\no4uLC6X9VFVV8cGDB6Sy+XTYvr6+jGVxcdity+vq6krPNTQ0EG5FZmJiQiqPicNmmzqyw6adh93c\n3AwuLi5gaGgInp6e0K9fP1YxsV++fAmBgYGQkZFBWc9HH30Eubm5oK6uLt2tXE1NDdTV1Xmda6so\nxsbGcPz4ccJzlpaWhDutUxEcHAz379+XO+7i4iIjS01NDUpKSghXBzY3N0Nubi40NTWxqltRvLy8\n4Pvvv1d4laCKigqoqqrypNX/SE1NpY2r0fL5bGhoAG1tbdi9ezdMmTKFFx1ycnJARUUFGhsbpcec\nnJwAEeHo0aOgrq4OWVlZUFdXB9bW1oQympqawN7enrSO3r17Q48ePRjpQxdds62eoby8PNrVsGpq\naoRhBlrarjUWFha8TxjYtWsXxMfH8yqTL2gdtoqKCty8eROKioogLS1NbhRdJBKBSCQiLLt+/XpY\nsGABHDt2jLIOOzs7OHLkCFhZWRGe79WrF4wYMYJO1XbDysoK+vfvL3c8MjKS8YtDx/Xr1+WOGRkZ\nwdatWwnzL1++nNWsGb5YsGABbN68mXLRTGZmJhQXF3OuIyAggFM5c3Nz2rC5v/32m8zz2bdvXxg8\neDCn+lpz9epV8PX1hXeNpnf4+PgQLj7R0NCAc+fOcZql4OTkRPruEEH3PvJNVlYWfPzxx4TPJ1EQ\nsenTp8v8X1tbC4mJiYSyd+7cyWjxFRucnZ1JfzzbAt4XzkjYuHEj7tixg/E87PLycvTw8KD9BGEy\n8h4TE9MhukQkEG3BxOUzuvU8VwDAsLAw0ms6duwY6XWUl5ezrp9PNm3aRKob3Y7RVFtaUUHVJQLw\nbtSfbE442fMpFovx5s2bXM0gZeXKlXKyf/nlF8oyWVlZ6OzszMoO2dnZhGMgZIlrFxNRYtIlcvPm\nTRSLxXJlN23aRJj/xo0bcnnp9ulkomvfvn3xwIEDeODAAezTp8/71yXSck7kmzdvICkpCVxdXRnP\nwx42bBjtknQLCwvYtGkTZZ4PjfHjx5N+tZARExMDurq6baQRM1atWgUbNmzgVWZaWhrl+f3794Om\npibp+ZEjR4KxsTHhuerqasLn89atW1BYWMhKz9YkJyfLdccEBweDt7c3ZTkXFxc4d+4caGtrM67L\n0dERTpw4AX379uWiqgz79+/nHAKCjMzMTMjLy5M7LvEhdPTq1Qu+/fZbyjwXL16klXP58mWYO3cu\nzJ07Fy5fvsyo7g4Hlae/ffs2urq6olgsRmdnZ9y+fTsiIuN52EySvb09o1+djtTCbmhowM8//1zh\nFvbevXsJ9UlLSyMt07qFraWlheHh4azqbQ9cXV1RS0sLtbS0GM2XX716tdxqTgBmg7gGBgaUzxjZ\nfOEnT55QlissLGR93YiIjx8/lpM1ZswYVjJatwC7detGW8bMzIz2fdPU1GRUf0ZGBi8t7MbGRtL5\n8nfv3iUs07KFzVTfvLw8Wl1bz7FuvdKRKI8yoHLLlCNkzs7OcOPGDWhqagI3NzdIS0uTBpJh2wJs\nK5SxyeepU6cgJiZGIRnV1dWQk5PDupxk8Bfg3SANWd+esmE7UBwREQGpqamcIss9f/6c8nn08fGB\nxsZGuYHNq1evUso1NTWFjIwMcHd3Z6xLRkYGDB06VO44XTTA1hQXF4O7uztkZmYCAHt7knHkyBFG\n+ZYtW6ZwXYgIu3fvpu+XbYWOjg7Y2tqCqqoqa7uRMXLkSNqvtc4Ao2h90dHR4OjoKH0p2jIeNlt2\n797dYX482JCXl0f7mUeEp6cnpKSkQEpKSod11sqAyw44TD6LJ02axHjbt4SEBJgwYYLcca474kh2\nc/f19aUMrcCGqVOnMsp39uxZmT07udDc3Cz9wWGDtbU1HDx4EI4cOQJ9+vRRSAcJXl5evMhRNrQO\nu6ioCM6fPw/z58+Xjna3VTzszkBFRQV88803ylbjvSMuLg4ePXokcywsLIzxlMGnT5+yrnPnzp3w\n97//nTJPaWkpnDhxgpE8IyMjuSlmW7Zsod0LkQxTU1P45ZdfYO/evdC7d29OMrhiZGTE2LmToaqq\nCgsXLiQ93zI+UWuGDh0Krq6uCtXfErat/I4KrcMODQ2FHTt2yOykwmYeNh9kZ2cTxir+6quv2n2X\n8jdv3rR5QPhp06a1+5xqZXPnzh2Z7cCWLFkCX3/9NWhoaDAqv3r1atZ1ampqMppGd/z4cTh69Cht\nPjc3N9DS0pI5FhUVJTOtj47o6Gjp1Nnu3buDn58fmJmZMS7PJ7Nnz5abYscna9asaTPZ7yuUDjsh\nIQEMDAzA1dWV9KGjmofNF/X19YT7RnbGrhAmlJaWgpqa2geze83PP/8s0z3k7u4OW7duhW7dujGW\nMXbsWNLdwdPS0kgX5vj7+8MXX3xBKbumpgZmzpxJO+OJCKZR9xARjh8/DitWrIAxY8bAw4cPWdWz\nZMkSKCoqYq0fFVpaWnI/QC2ZMGECREZGcpbv5eWl1G3HevbsCefOnVNa/RLYzMOmdNh//PEHnD17\nFiwsLCAgIABSUlJg9uzZYGhoKDV0SUkJ7eqptmLTpk2EC0zeFyZOnKhsFZRCbm4u6262169fw61b\nt+SO9+vXj7b/19nZmdGelJ6envDrr7+y0mvlypWMGhYNDQ2wa9cu6f9UKxuJ2LNnD8ydO5dVGUXh\nYycbttfJN0ZGRjLb0I0ePbrddfDw8ODHYW/ZsgUKCwshPz8f4uPjYcyYMfDjjz/yFg9bgJrFixcr\nW4V2R0VFBaKiomDOnDmsytXV1cm11hwcHOCHH34AR0dHyrLz589nPAA8c+ZMVnqFhYUx+vTv0qWL\n3Bz2qKgoVnV9//33rPIrCpduqLaiR48e8Omnn7Iu5+bmBt9//z1YWlrCpk2bYOfOnW2gHX+w2uJb\n0lIICwuDpKQksLW1hZSUFAgLC+OswNOnT2HHjh2UeaytrUk/W5cvX86qj7AzIRnYfZ8pKyuTWTgl\nEolg1qxZrOUYGBhATEyMzMyGsrIyKCgoYFQ+ODiY0eKLM2fOsNaN6yyqIUOGsMrP9sfkfUJPTw8m\nTZrEqezo0aPh559/hrVr1/KsFf/QOmxzc3Po378/rFixQqYFw1f/cW1tLWHwo5ZI5mUSoWiXSFBQ\nEMyePVshGQLcsbCwkK6sS0lJYbzDCBG//fabTKybwYMHg5+fH+Py48aNo53Cx9Upjh07lnWZ5cuX\ns8rf3i3s94kBAwbITKzoqNBqKBKJIDU1FbKysqRzKvmehx0XFwcbN26Et2/fKiSHC+rq6h1me7AP\niWfPnoGxsTFUV1dD9+7dwdHREfT19RXaLmzu3LmwcuVK6NKlC6iqqoKFhQV06dKFlYxevXpRjsmU\nlJRw2h2eyzhPZmYmLFiwgHF+LS0t3gceOxO6urq8BV9ra5qamiAmJkY6aUOSaL+q6ZZJmpuby+20\nwTQeNttEtlQVsW2Wpuvp6WFsbCxt2ZZQxcNmszSdKlCSJL2PlJWVYVxcHLq7u0uXHn/33Xe81mFj\nY4PLli3jXP7cuXNoYmJCeE8WL15MWdbS0pKwHJN45UTxxUeNGsVqF3gmu9CzgSg4mSRt3LiRtjzV\nlmsAsvGw+WDp0qWd4n3atWsXp3vEqIXt7e0Nbm5u8O9//xsA2m4e9r/+9S9e5DBl7969nFbI8UFd\nXZ1S6lU2ixYtgpCQEOlc9r1798KSJUt4r4fNlMDWTJw4Efbv3084pY2r3Hnz5sHJkydZl0tLS4P0\n9HROdbYlMTExsG7dOmWr0SlZv3497VRSUuh+CSQbWL548QLFYjGmpaWhnp6eTB6ireiBQwubSp22\naGF/8skntOVaQ9aCCQoKIgyCRaXPmjVrSK8pMzOTtW4dHaJdh/gmPDwcNTQ0OO/p2JKbN2/K6ZuV\nlUVZhqyFDQC4cuVK0nJ1dXVoa2tLWI7PHZHY2pyshT1gwADasvX19aTXxNSebOkMLWwbGxvO94jV\nVaxfvx537tzJOB62qqoqLw6bKD5uy9RyKy0yyKL1zZ8/n40JKPVgs3nn7du3CaPTdcQHTBGam5ux\nsbERR48eTXqd06dPx8bGRkY/vFS0jH5oZGTEi/4t9bx9+zZtfiqHDUAeD5vsXRGJRBgXF8dY3/bs\nEmFCamoq5XOuq6uLTU1NrHQi4/Dhw6T1KDtWvISQkBDOfhCRpkuktrYWqqqqAODdaq/ExERwdnZm\nPA+bKAZuW8Bk0JBsOfnz58/h6tWr0vTgwQPOevj5+cGFCxcY5XV2doYTJ05QxojgOy5xe3P16lX4\n5ptvQE1NjTLQ0okTJ0BNTQ1iY2Npt5Ijo7a2FvLz86X/nzx5kvVqQTroZhPl5uZSdnX17duXdBFP\ncnKy3DFNTU3YvHkzBAcHs9KzIzF69GjKePevXr3iZYFYcXExZT36+vqEC6vam7Vr14KpqSl3AVS/\nBnl5eSgWi1EsFmO/fv1wy5YtiMg8HnZFRQXrTT2JoGth01yGVB8mqV+/fpTdEUxkHDx4kFYfCVQt\nGDMzM7x8+TJjWR2FpKQkjIqKYnXfJUlFRQWjoqIwKioKHz9+zLjOL7/8Uk6Ws7MzRkVF4bNnzzhd\nx759+6SyvLy8pF+VZBw+fBh1dXVJr83T0xNzc3Plyh09epQwv5GRESYnJ7PSuaO1sBERExMT0cLC\nglSOg4MDRkVF4dOnT1np1pJXr17h7Nmzad9tZXL79m3GOwORQWv1ly9f4rRp09De3h4dHBwwIyMD\ny8vL0dvbm9ZhIyIWFBTguHHjGL+w8+bNk5PFxGHTBfFn4zTEYjHeu3ePsxwdHR3cs2cPnWkxOTmZ\n8kEGeLfBQ2BgoEy6ePEirWxlkZKSQtstwDSNGjUKAwMDaeuk+8z09vbGwMBAXLduHatrOXHihFTG\n4MGDGW1ocObMGUpdRo4cKfO+xMXFYY8ePUjzW1lZYWpqKmOdO6LDRkScOHEirV5jxozBt2/fspIr\nYfLkybTydXV1MTAwkJU9+eLJkyc4aNAgxs8+GbRWnzNnjnTQo6GhASsrK3HVqlW4bds2RESMiIjA\n1atXywv+/0oTEhIoWx1ERm1NXV0d6a4VktS1a1fcunUr4TUwuZkt0z/+8Q+sra2VkzNq1CjGMnR0\ndPDo0aOUtq2srMSpU6eydmS9evVi1J/a3uTk5NDu/sIlWVtbkw7YBQcH044FtHxGrK2tpYmOgQMH\nSstu3ryZkTPx9PSk1cPc3Fyan2qQDABwxowZ+OrVK8b3gM5hX7t2jbEsRP4cdnFxMSM/YGVlhZMn\nT2Ylm2gwmyoZGBigtbU1Dhw4kFU9inDr1i1GukkaH2RQWr2yshItLCzkjjOdhy0hLCwM1dTUaJUl\nm5PZ2NiI27dvpyyrpqZG2YKiasW0Tt26dcMrV64QG4yhDA0NDUaDRUuWLGE9OKupqcn76LqiVFRU\ncHLGHSX9/vvvWF5eLpMkz4yGhgYeOnSIsS2Y1KeiooILFy5ERCQdkLW3t8fq6mpW94HOYbee4UVF\nXV0d+vv7E8rp2bMnK70QEa2srBjfj6VLlzKWa29vz+mei0Qi9Pf3Z30dXGCij5qaGv7zn/9EAI4O\nOysrCwcPHozBwcHo6uqK8+fPx+rqapmb3tzcTPgQtKz03r17tFNZnJycsKCggFAPJl0iw4YNozTY\n27dvGfUfGRgYYHx8PKkcJi0oTU1N3LlzJ6U+LWEyzQfg3Q+Jr68vnjlzhrHs9mLatGmk9vT19UVf\nX1/GLeGOkry9vdHX1xcjIyNZ2WLMmDGUcseMGYPBwcEyZYh2K9fW1saoqChWdf/444+k9Q4YMAAr\nKysZyyLbcxQAsLGxkZVeEpjYvXv37qwXU1HNQiJLM2bM4HQNbElMTGSkj7u7OyJS7+lI6bD/+9//\nopqamnQQbvny5fjVV18xnocdHh6O4eHhOHPmTNTX1ydV1MXFBa9fv06qBxOH/cMPP1BdCiISryRr\n7RCPHDlWsAEYAAAKGElEQVRCKePVq1c4ffp0SjnR0dG0urQkOjqa0Q1l6zjak4iICFp7bty4ERcs\nWKB0R8wkBQQEEHaLMaGqqgpnzJghJ3Po0KEYHh6OVVVVcmXq6+vlBs1iYmJY1x0aGkp5TTU1NYxl\ntYXDpusCAgDcu3cva7kVFRXo5+fH6h63F5s3b2akz6RJkzA8PJxSN0qtS0pKZPrb0tPTcfz48Whv\nb89oHnZL0tPT5RSMj4/HhIQEvHXrFuUF0znshIQEyvISSktLMSEhQaZvsmUi2127NcXFxZiQkIB2\ndnYy5detW8dYl9YkJCRgQkIC6Y8Bm09yZUDUP0lkz8rKSvzss88UcqZdu3aV2kuSqBZWsU0zZszA\n58+fK2SPkpIS9PPzwx07dmBCQgI6ODiQDmRLKC0txVmzZiEA4PHjxznV6+XlRXpd69atYzWoR+Ww\nx48fz0m/qqoqXLFiBalcNmsZWlNUVIQTJkxgdI/37dvHuR4u0Olz9uxZmbykcugqGjlyJD58+BAR\n360iW7VqFa5atQojIiIQEXHr1q2Ug44tKS0tlUlMF0o0NDTgtm3b5C5yw4YNWFpaykhGS169eoXd\nu3eXyomOjuYk5+XLlzLXw3WEuyXV1dVSeRL9Dhw4wNvigrairKxM7v6QDezV1tZiaWkpabwOutQ6\ntg3iu2dEYjdFZqkMHz6cVSuUiurqaqyvr0dEZNwVUVNTw+rdaM3t27cJr2vZsmWsn08qhy0Zw+JC\nbGysnLywsDCFrltCVVWVzLtDlKKjozl/IXDBzMyM9rlrqY9CDvvmzZvo5uaG/fv3x6lTp2JlZSXj\nedh8ExoaipaWltK+HoGOg1gslnsIycYkWvLXX3+hpaWlTOrbty/hQ92rVy+8f/8+I33Ky8vR0tIS\n9fT0aF8WfX19tLS0VNQEHYKGhgbCrw09PT1MSkpiJYvMYVON8bBh3LhxaGlpiSEhIbzIa8mbN2+w\nT58+Mnp36dIFw8LCeK+LCqKxiZapZ8+ecus+FHLYXGkLhy3QsfH395eZa0o0p54J9+/fR39/f7nE\n1uEgIh45ckRavvXLoqmpif7+/njhwgVOenZEysvLcfjw4XLXSvQVTAeZw/by8urwX3yI8mNW7b1w\n5s6dO9i/f39SZ21sbEw4gYCzw37w4AG6uLhIU/fu3TE6OprVwhmBD4s7d+5gZGRkhxwgleglSUwG\nqjsjOTk5crMm3Nzc8M6dO6zkEDnsiRMnYlFRURtpzi9Pnz7FyMhIdHBwQID2HweiWzsyZcoUwnK8\ntLCbmprQyMgInz59ymrhjICAQPuSkpIi1600c+ZM1gGQiBy2ZOyqM5GdnU26rqKtuHbtGpqampI6\naz09PcrV1GSoAUOSk5PB2toaTE1N4ezZs9JgPkFBQeDh4aHwrjMCAgL8UFZWBsXFxdL/R4wYAYcO\nHQJVVVUlaqU86DZhbgsqKyuhsLCQ9PyLFy847XTF2GHHx8dDQEAAADDfwKDllu0eHh7g4eHBWkEB\nAQHFEIlEvDnrXbt2gY+PD7i6uvIi732kuroaxo0bR3o+JydHxlmnpqZCamoqI9mMHHZ9fT2cO3cO\ntm3bJndOshcZES0dtoCAQOfG3Nwc9u/fLzhrGg4cOEB6ztPTE3R1dWWOtW7MbtiwgbQ8o22CL1y4\nAAMHDpTGbjY0NJTuoF5SUsJpg1EBAYHOhVgsBh8fH2Wr0eFZsWIF6bmgoCCF/CUjh33s2DFpdwgA\nMN7AQEBAQOBDYtq0aaTnZsyYAVOmTFFIPq3DrqmpgeTkZPDz85MeCwsLg6SkJLC1tYWUlBQICwtT\nSAkBYpj2awlQ86HbsampCerr65WtxgfBs2fP5I5pamqCu7s7LF68GPT09BSST+uwtbS0oKysDHR0\ndKTHevbsCcnJyZCTkwOJiYkKKyFAzIfuaPjiQ7djdnY2HDx4UGE5ZWVl8PTpU8UVeo/JyMgAT09P\nEIvFYG5uDp6enlBWVgYZGRm8PIeMZ4kICAh0PkQiEURERMDChQsVlvXixQvIz8+Hjz76iAfN3l9S\nUlLg0aNH8OTJE/D29uZVNqM+bAEBgc7DsGHDYNKkSQAAoKKiAosWLeIkZ9y4cdLZCxoaGnD48GEY\nPXo0X2q+19jY2PDurAEARP+/soZ/wSRT/QQEBAQEqCFzy23WJdJGvwMCAgICHyxCl4iAgIBAJ0Fw\n2AICAgKdhDZx2BcvXgR7e3uwsbEhXM4u8I65c+eCoaEhODs7S49VVFTA2LFjwdbWFnx8fKCyslJ6\nbuvWrWBjYwP29vaQmJioDJU7HIWFheDp6Qn9+vUDJycn2LVrFwAIdmRLXV0duLu7g4uLCzg6OsKa\nNWsAQLAjF5qamsDV1VU68MurDfkOK9jY2IhWVlaYn5+P9fX1KBaLafey+1BJS0vDGzduoJOTk/QY\nWeja7OxsFIvFWF9fj/n5+WhlZdUpgsi3NSUlJZiVlYWI77aHsrW1xXv37gl25IBka7SGhgZ0d3fH\n9PR0wY4c+OabbzAwMBAnTZqEiPy+07y3sDMzM8Ha2hrMzc1BXV0d/P394cyZM3xX814wcuRI6NGj\nh8yxs2fPQlBQEAC8iztw+vRpAAA4c+YMBAQEgLq6Opibm4O1tTVkZma2u84dDSMjI3BxcQEAAG1t\nbXBwcIBnz54JduSApqYmALwL9tbU1AQ9evQQ7MiSoqIiOH/+PMyfP1868YJPG/LusJ89ewampqbS\n/01MTAiXawoQQxa6tri4GExMTKT5BLvKU1BQAFlZWeDu7i7YkQPNzc3g4uIChoaG0m4mwY7sCA0N\nhR07doCKyv9cK5825N1hC/Ov+YMqdK3kvMA7qqurYdq0aRAdHS0TRgFAsCNTVFRU4ObNm1BUVARp\naWlw6dIlmfOCHalJSEgAAwMDcHV1JZ3WrKgNeXfYxsbGMjstFBYWyvyKCFBDFrq2tV2LiorA2NhY\nKTp2NBoaGmDatGkwe/ZsaeRIwY7c0dXVhQkTJsCff/4p2JEFf/zxB5w9exYsLCwgICAAUlJSYPbs\n2bzakHeH7ebmBo8ePYKCggKor6+Hn376CSZPnsx3Ne8tZKFrJ0+eDPHx8VBfXw/5+fnw6NEjGDx4\nsDJV7RAgIsybNw8cHR1l4hALdmRHWVmZdPbCmzdvICkpCVxdXQU7smDLli1QWFgI+fn5EB8fD2PG\njIEff/yRXxu2xSjp+fPn0dbWFq2srHDLli1tUcV7gb+/P/bp0wfV1dXRxMQEY2Njsby8HL28vAh3\npN+8eTNaWVmhnZ0dXrx4UYmadxzS09NRJBKhWCxGFxcXdHFxwQsXLgh2ZMnt27fR1dUVxWIxOjs7\n4/bt2xERBTtyJDU1VTpLhE8btlksEQEBAQEBfhFWOgoICAh0EgSHLSAgINBJEBy2gICAQCdBcNgC\nAgICnQTBYQsICAh0EgSHLSAgINBJEBy2gICAQCfh/wDLJ9V8FFzLowAAAABJRU5ErkJggg==\n"
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As you can see, a whole page of text only gave rise to a few lines of training text,\n",
      "but that's enough, since there are many scanned pages we can easily get a hold of.\n",
      "\n",
      "You can control the number of lines returned with the `-w` and `-m` arguments;\n",
      "with really short lines, you risk accidentally matching a word in the dictionary\n",
      "and making text line geometry modeling harder, while with really long lines, you\n",
      "will simply not find a lot of examples.\n",
      "\n",
      "Furthermore, after generating training data like this initially, it is easy to\n",
      "verify it by training on it and seeing which lines remain misclassified."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note that you can also use OCRopus itself to perform these steps; even if its performance\n",
      "on some new font is fairly poor, it will pick up a lot of good training data, and that\n",
      "will allow it to improve itself."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}